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Massachusetts Institute of Technology 

We introduce a new framework for constructing tests of general 
semiparametric hypotheses which have nontrivial power on the n -1 ^ 2 
scale in every direction, and can be tailored to put substantial power 
on alternatives of importance. The approach is based on combining 
test statistics based on stochastic processes of score statistics with 
bootstrap critical values. 

1. Introduction. The practice of statistical testing plays several roles in 
empirical research. These roles range from the careful assessment of the 
evidence against specific scientific hypotheses to the judgment of whether 
an estimated model displays decent goodness of fit to the empirical data. 
The paradigmatic situation we consider is one where the investigator views 
some departures from the hypothesized model as being of primary impor- 
tance, with others of interest if sufficiently gross, but otherwise secondary. 
For instance, low-frequency departures from a signal hypothesized to be 
constant might be considered of interest, even if of low amplitude; while 
high-frequency departures are less so, unless they are of high amplitude. 

The optimal testing of a simple hypothesis against a simple alternative 
is the cornerstone of modern statistical theory. However, there is no clear 
notion of optimality for more complicated situations. The Hajek-Le Cam 
asymptotic theory proved that there exist strong concepts of asymptotic ef- 
ficiency in parametric estimation. These ideas have been extended to semi- 
parametric models — see [3, 14, 22]. However, there is no compelling sense of 
an asymptotically optimum test, in either the parametric or the semipara- 
metric asymptotic theories, save for some simple one-parameter hypotheses. 
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We deal exclusively with the "elementary" case of i.i.d. data for ease of 
exposition. Moreover, all our considerations are asymptotic save for illus- 
trative simulations. Generalization of this point of view to the two-sample 
problem, independent nonidentically distributed case, time series, and so on, 
is conceptually not difficult, and may be even simpler because of availability 
of permutation tests. 

The general types of tests that people have constructed fall into one of 
two classes: 

(i) Those which have nonnegligible asymptotic power for departures on 
the n _1//2 scale in every possible direction. In the standard problems of test- 
ing goodness of fit to a single distribution against all alternatives, these are 
the classical tests of Kolmogorov and Cramer-von Mises and their classical 
extensions to the problem of testing fit to a parametric hypothesis on the 
one hand and independence on the other. 

(ii) Those which have trivial asymptotic power at the re -1 ' 2 scale in every 
direction. The x 2 tests with increasing number of cells as n — > oo are the 
preeminent example of this type, but a number of variants have recently been 
explored through devices such as empirical likelihood — see [10] for recent 
examples. 

Tests of type (ii) have the feature that they have approximately equal 
power in all directions. As a consequence, they can enjoy minimax properties 
over suitable nonparametric families of alternatives — see [15], for example. 
But, as we noted, they pay for this by not having power at rate re" 1 / 2 
in any particular direction. The tests of type (i) have the weakness that 
they concentrate their power at the n -1 / 2 scale in very explicit alternative 
directions, dictated primarily by the metric, implicitly or explicitly, used. 
For example, the Kolmogorov test for goodness of fit to the uniform (0, 1) 
distribution is well known to have power mainly against alternatives such 
that \P(X < 1/2) - 1/2| is large. 

The principal reason for limiting oneself to tests of types (i) and (ii) ap- 
pears to have been the need for simple approximations to the critical values 
under the null which need to be coupled with specification of a test statistic 
to implement a test. However, the critical values can be approximated by 
bootstrap methods, as discussed in Section 3.3. 

Our goal in this paper is to show that it is possible to construct tests 
for any semiparametric hypothesis, which have as much power as possible 
at the n -1 / 2 scale in a few directions of interest, specific to the particular 
scientific problem investigated, reserving some power for gross departures 
(in the n -1 / 2 scale) for other directions. 

We clearly do not adopt the minimax and adaptive minimax testing point 
of view of Ingster [15] . Our proposal does not aim at minimaxity and since 
we concentrate on the re" 1 / 2 scale our tests do not have uniformity properties 
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except over relatively small families. We believe that in testing, even more 
than in estimation, prior information or biases need to be paid attention to, 
since, as Janssen [16] points out, achieving reasonable power over more than 
a few orthogonal directions is hopeless. 

There has been another direction that we want to mention but do not 
develop in this paper. The idea is to construct a sequence of tests which 
are consistent against broader and broader classes of alternatives as one 
proceeds down the sequence, stopping testing at a data-determined point 
on the sequence. A limited proposal of this type was made by Rayner and 
Best [24] and developed more generally in [6]. Some important special cases 
are discussed in [13], Chapter 7, in the context of testing the hypothesis of 
no effect in nonparametric regression. 

Our general approach, which is detailed in Section 3, is to use as building 
blocks one-dimensional score (Rao) test statistics for simple hypotheses. 
For composite hypotheses we use the natural generalizations of Rao tests, 
efficient scores. These efficient score tests are called Neyman C(a) tests 
in the statistics literature or conditional moment restriction tests in the 
econometrics literature (see [2]). 

Conceptually, as we discuss in Section 3, our approach applies to general 
semiparametric hypotheses such as independence, the Cox model in sur- 
vival analysis and index models in econometrics. It also, as we demonstrate, 
guides us how to proceed when we test a parametric or semiparametric model 
within a semiparametric alternative, for instance, independence within cop- 
ula models, simple index versus multiple index models. We view this as the 
most important nontechnical contribution of the paper. 

In Section 3 we give some general conditions under which the asymp- 
totic theory for the types of test statistics discussed in this section, and 
for appropriate bootstrap critical values, is valid. We also study the power 
behavior of these tests under these assumptions. In Section 4 we discuss 
the classical examples of goodness of fit to a parametric hypothesis and in- 
dependence. We show how the classical tests of Kolmogorov-Smirnov and 
Cramer-von Mises type fit into our framework, and also derive a variety of 
new tests based on our principles. We indicate how the general conditions 
of Section 3 are implied by mild and easily checkable conditions in these 
classical situations. We have chosen to exhibit the approach in detail in two 
situations here, namely parametric hypotheses and independence. Tests of 
index models are covered in [7]. However, as we have indicated, our approach 
is applicable to any of the hundreds of goodness-of-fit problems that arise 
with semiparametric models. 

2. Heuristics. 
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2.1. Parametric tests. Suppose that Xi,...,X n are a (i.i.d.) random 
sample from the probability P £ Q, where P <S fi, p = Suppose that 
Q ={Pq : 9 £ R} is a regular (one-dimensional) submodel of probabilities. 
Consider testing the hypothesis H :P = Pq against K :P = Pg where 9 > 0. 

Denote the log-likelihood of an observation and its derivative at 9 = 0, 
the efficient score (see [23]) by 

(1) £{;P) = hip, £(;Po)^ ( 



89 ' 

where £ £ L,2(Po), Eo(£(-,Po)) = 0. The familiar scoring test of Rao [23] is 
based on the mean score test statistic, and is locally asymptotically most 
powerful. Since n~ 1/2 T l 7=J(X i ,P )^M{0, \\£\\ 2 ) under the null, it uses as 
asymptotic critical value zi_ a ||^||o, where \\h\\Q = j h 2 dP$ and zi- a is the 
standard Gaussian 1 — a quantile. Note that this test is consistent if and 
only if Eg(£(X,Po)) > for 9 > 0, namely if a nonzero 9 implies a positive 
mean score. 

All the nontechnical difficulties in testing are already in composite para- 
metric hypotheses, although they are traditionally ignored. Suppose that 
the general family is defined as V = {P^ ^ : 9 £ R q ,rj £ R p } so that the gen- 
eral log-density takes the form £ = £(■, Pr v m), which for simplicity we write 
£(rj,0). 

Suppose first that q = 1 < p. The null hypothesis is the restricted fam- 
ily Vq = {P( ??j o),?? £ R p }. The null set can be approximated locally by the 
tangent space associated with the null hypothesis, 

(2) ^o(Po) = s P an{^):l<i<p 

The score function of the alternative is £ = d£(r]Q, 0)/d9. However, part of 
this score is in fact in the null space. Therefore, the efficient score is that 
part that is left by removing contributions from directions in the tangent 
space, 

1 = TTn Zajfa.Oo)- 



89 ~ 3 ' di] 



j 



where the aj's are projection (least squares) weights; namely {aj(r/o,0)} min- 
imize ||^ — X)j=i a j{ r l0i 0) 9£(r]o, 0)/dr]j 1 1 § . If q = 1 and fj is a -^/n-consistent 
estimator of rj under H, then the Neyman [21] C{a) test statistic is T = 

When q > 1 =p, V = {Pq '■ 9 £ R 9 }, the departure can happen in different 
directions. The tangent space V(Pq) is the linear closure of {d£/d9j:j = 
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1, . . . , q}, and a standard test statistic is 

1 1 



T 



1 



n 



i=l 



'-0 



i=l 



where Iq is the information matrix. The Rao tests, which are called Lagrange 
multiplier tests in econometrics, have the advantage of making use of esti- 
mates of the statistical model only under the null hypothesis. In contrast, 
Wald tests [27] and likelihood ratio tests are based on comparing estimates 
of the model under alternatives with those of the model estimated under 
the null. In the parametric context, to first order, both Wald and likelihood 
ratio tests are equivalent to score-based tests. 

It is important to note that T is just one way of combining the different 
test directions. There is nothing magic in the Mahalanobis distance. Suppose 
that we can rank the alternative one-dimensional models for which hi,...,h q 
are score functions in order of plausibility. If they are orthogonal, it is plausi- 



ble to use T a = J2 q j=1 Xj 



n 



-1/2 



Ya=i hj(Xi)) 2 where < A x < • • • < X p reflect 



the relative importance of the hj. In general we would arrive at 



(3) 



li=l 



AEo X A 



i=l 



where A = Diag(Ai, . . . , A p ). 

Another alternative is to use the union-intersection principle of Roy [25] , 
to obtain 



(4) 



max 

i<i<P 



In general, any norm of the vector (hi, . . . , h q ) could be used as a test statis- 
tic. 



2.2. Semiparametric essentials. When the hypotheses are composite and 
semiparametric, the collection of score functions that generalize (1) and (2) 
depends on P, and no longer consists of Euclidean spaces. 

Let Q={Pg.O G R} C V denote a regular one-parameter submodel with 
Po the true distribution. Clearly, the score in the model Q, hQ = £ depends 
on Q. A composite hypothesis V is the union of many, usually an infinite 
number of such Q's. The relevant set of (alternative) scores is the tangent 

space V(Pq) defined as the linear closure of all the associated scores hQ — see, 
for instance, [3], Chapter 2, for details. 

We parameterize the model by V = {P( a ,f3) ■ a £ A, f] G B} where A, B are 
subsets of function spaces, and the null hypothesis is H:(3 = 0. Define the 
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"full," "null" and "alternative" tangent spaces (see [3], page 70): 
V(ct,P) = tangent space of the model V at P( a ,/3): 

V{ot, 0) = tangent space of {P( a ,o) -ceG A} at P( a ,o)t 

* i * * 

P (a,0) = orthogonal complement of J>o(a,0) in J>(a,0). 

That is, Vo(a,0) ±Vo(a,0) and T>(a,0) = p(a, 0) © T>o(a, 0). 

The space Vo(Po) captures the directions of variation from Pq that are 
consistent with the null hypothesis of interest. To test 9 = 0, we should 

remove from any £ G V(Pq) its component that is actually consistent with 
the null hypothesis and is in Vq(Pq). Therefore, the effective direction of 
interest for the alternative Q is given by efficient score function 

£*(;P )=£(;P ) ~ U(£,a), £(;P ) G P(P ), 

where H(-,a) is the projection operator from L2\P a ) to the subspace Vo(P a ) 
of L^Pa), the space of square integrable functions with mean zero under 

Pa- 

Notation. We write Q(h) for JhdQ, and P n for the empirical distri- 
bution function. 

3. Score tests. 

3.1. The score process and the testing paradigm. The above ideas moti- 
vate our general testing paradigm. 

Let r be some index set, and let h = h(j, a) G L^iPa), 7 G T, a G A, be 
some test function. Recall that A is the parameter space under the null. The 

score process is 

1 n 

Z n ( 7 ,a)^^^n ± (/ l ,a)(X i ), 
v n i=l 

where H 1 - is the projection from L^Pa) to 'Pg"(a,0). 

r is an index set, pointing to a direction h(-f,a) in the tangent space, or 
more generally in L^Po), such that {h(-,a)} is not too big, say a universal 
Donsker class. As we shall see in examples, the reason for making h depend 
on a also is that it is natural to have the family of scores depend on where 
we think we are in A. To avoid technicalities, we assume that for all x and a, 
hj(x,a) G Zoo CO, the space of all bounded real-valued functions on T. We 
may write hj(a), suppressing dependence on x. 
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In general, Z n (^,a) is not computable given the data, but if d G A is an 
estimate of a we can consider 

Z n (i) = Z n (-/,a) 

defined on T. If d is an MLE, Z n simplifies as in the parametric case to 

since P n (v) = for all v G "Po(d) is a restatement of the likelihood equations. 

In particular, P n (U(h, d)) = 0. We will also consider Z n more generally for 
a an efficient estimate in the sense of [3], Chapter 5, pages 179-182. 

We think of Z n (-), Z n (-), and so on as stochastic processes defined on T 
related to empirical processes — see [26], for instance. We shall use Z n and 

Z n to construct tailor-made tests. 

Let Aq C A be a neighborhood of the true ao where A is a metric space 
with metric p. As above, we write Po for P ao , and so on. We always require 

(5) Z n (-,a ) => Z(-,a ) 

under P ao , for all ao G Aq, in the sense of weak convergence for /°°(r)-valued 
variables, where Z(-,ao) is a mean-zero Gaussian process with 

cov(Z(7i, ao), Z(j 2 , «o)) = cov (7ii, h 2 ) - cov (U(hi,a ),U(h 2 ,a )) 

with the obvious convention in notation. (To be exact, we should be speaking 
of outer probabilities since we interpret weak convergence in the sense of 
Hoffman-j0rgensen. But measurability issues can be dealt with easily in 
the situations we are interested in and we ignore them in the future.) The 
property we want is 

(6) Z n or Z n {-) => Z(; a ) 

in the same sense as above. 

We propose to base our tests on the score process, at least conceptu- 
ally. What (6) will give us is the weak convergence of statistics of the form 
T(Z n (-)) where T:Z 00 (r) — > R continuously. Possibilities are 

(7) T„ = J Z2( 7 )d/i( 7 ) 

for fi a finite measure on T, or 

(8) T K = sup\Z n ( 7 )\ 

r 

or more general (i norms of \Z n \, or even a-dependent /x's which are suitably 
continuous in a. By taking the span {h(j,a) :j G T} dense in P^O) for 
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all a, and fx with support T, we can expect consistency against all alterna- 
tives. We will illustrate further in the examples of the next section. A simple 
example is the following. 

Example 3.1 (Goodness-of-fit statistics). Consider testing the null hy- 
pothesis that a distribution on R is Po against "all" alternatives, namely 

where P(Po) = ^2(^0)- We consider the family of directions h y (-) = 1(- < 
7) — Fq(-j), 76 R, where Fq is the cumulative distribution function of Po. 
The following two statistics arise in association with those above. Asso- 
ciated with (3) is the familiar Cramer-von Mises (CvM) goodness-of-fit 
statistic, T a = n /(Pn(7) — Fo(t)) 2 dFo^j), where F n is the empirical dis- 
tribution function, and the weighting measure is [i = Fq. Corresponding to 
(4) is the familiar Kolmogorov-Smirnov (KS) goodness-of-fit test statistic, 
Tb = sup 7 \ ^/n(F n ('y) — P (7))|. Note that these /i 7 here are typically chosen 
because we start with the cumulative distribution function as our repre- 
sentation of the probability P, not because of a desire for power in clearly 
defined directions (which is the result). 

3.2. General theorems. We close this section with some general theorems 
giving essentially the minimal conditions under which our heuristics for test 
statistic construction and critical value setting are justified. Checking the 
conditions of these theorems is the major difficulty. 

Here are the conditions we use for our theorems. The estimate d is such 
that for P = P = P ao : 

(MO) {/i 7 (-,ao) — n(/i 7 (-,ao))"o) : 7 G T} = {IT L (/i 7 (-, ao)) :7 G T} is a uni- 
versal Donsker class. 
(Ml) ||(P n -P )(n(/ i7 (-,d),d)-n(/ l7 (-,d),ao))||oo=op(n- 1 / 2 ). 
(M2) sup{\\(P & -Po)h- y (;a)+P U(h- y (;a),a)\\ 00 :ae A} = op(n -1 / 2 ). 
(M3) ||(P n - P )(fr 7 (-,a) - M->«o))||oo = o P {n- 1 / 2 ). 
(M4) || (P fi - P )(/i 7 (-,d) - /i 7 (->«o))||oc = opin- 1 ' 2 ). 
(M5) ||(P a - P )/i 7 (-,a ) - (Pn - Po)n(/ l7 (-,a ),a )||oo = op(n" 1 / 2 ). 

Notes. 1. (M3) and (M4) are automatically satisfied if h^(-,a) does not 
depend on a. 

2. If Tto = {/i 7 (-, ao) — P Qo /i 7 (-, ao) : 7 G T} is a Donsker class, showing that 
{II(/j, ao) : h G TCo} is also Donsker is usually not hard. For instance, suppose 
IT preserves order, H(hi, ao) < n(/i2, ao) if hi < hi . If TC = {/i 7 (-, ao) : 7 G T} 
satisfies the bracketing entropy condition of Theorem 2.8.4, page 172, of [26], 
then since II(-, ao) is Z<2(Po) norm reducing, {II(/i, ao) : h G TLq} also satisfies 
the same condition. Thus (MO) is usually not difficult. 

3. (M5) says that d is efficient under H, a generalization of the require- 
ment that d be (a regularly behaving) MLE — see [3], pages 176-182. 
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Here are two theorems. 

Theorem 3.1. If (MO), (M3), (M4) and (M5) hold, then for all oq, 

(9) Z n (-,a ) =>Z(;ato), 

(10) t n (-) = Z n (;a ) + o PQ (l), 
and hence, 

(11) Z n (-)^Z(-,a ). 

Proof. By construction and (M3) 

K(i) = n 1/2 {(P n - P )(/i 7 (;&)) - (P& ~ (-.«))> 

= n 1 / 2 {(P n - P )(/i 7 (-,a )) - (Pfi - Po)(S(->«))> + °f( 1 ) 
= ^{(Pn - P )(/i 7 (-,a )) - (P & - P )(h 7 (;a ))} + op(l) 
by (M4). Finally by (M5), 

L(l) = n 1/2 {(P„ - P )(h 7 (-,a ) - n(/ l7 (-,a ),a ))} + o P (l), 

which is just (10). Note that op(l) is interpreted here in the sense of || • ||oo 
on functions of 7. Conclusions (9) and (11) follow immediately from (MO) 
and (10). □ 

Theorem 3.2. // (M0)-(M3) hold, then 

Z n (i) = Z n (j) + o Fo (l). 

The proof appears in the Appendix. 

Condition (M2) is implied by the following two more easily checkable 
conditions. See Lemma A.l. 

(Nl) (i) a is consistent in the Hellinger metric /?h given by p^(a\,ot2) = 
KVdP^-^/dP^) 2 . 
(ii) Let Aq be a fixed Hellinger ball around ao an d suppose that 
/t>F a , a £ Aq, and with [i a probability measure. Let || • ||^ be 
the Li(\i) norm. Write s(a) = y 1 dP a jd[i, s = s(a), so = s(ao), 
assume 

|| (§ - S ) 2 /§\\1 = J (S /S - Iff dfi = Op^ 1 ). 

(hi) Let denote projection in L2(a0 onto the tangent space at s(ao) 
of C = {s(a) :a £ A}. Assume, p — sq — U^(s — sq)\\^ = op(n -1 / 2 ). 



10 



P. J. BICKEL, Y. RITOV AND T. M. STOKER 



(N2) (i) sup{||/i 7 (-,a)|| 00 :7Gr,aG A } < oo. 

(ii) sup{||n(/i 7 (-,a))|| oo :7Gr,aG^ } <oo. 

Note. In smooth parametric models (Nl)(iii) holds if \\s — sq\\u = 
op(n" 1 / 4 ), \\s -s - n M (s - s )IU = 0(\\s - s |g). 

We may wish to consider (see below) statistics in which the averaging 
measure also depends on a, say / Z^(h^) d/x(7,a). This too can be dealt 
with by a condition such as a — ► p(-,a) is uniformly continuous on 7i in 
the bounded variation topology on the finite signed measures on T. More 
generally, we may simply consider any test statistic of the form F(Z n ,&), 
where F : Zoo(r) x A — > R is continuous in the l^ x p topology. 

3.3. Setting critical values: the bootstrap. There is a novel issue that arises 
in the context of composite hypotheses. The statistic Z n = J27=i l*(Xi,P&) 
arising from the situation where there is only one direction of departure is, if 
P a is true, an approximation to Z n (a) = Ya=i l*(Xi,P a ). Since Z n (a) has 
an jV(0, 7(P Q )) limiting distribution, a Gaussian critical value using I(P&) 
is appropriate. On the other hand, if the null hypothesis is simple, critical 
values for any statistic can be obtained by simulation. But in the general 
situation of composite hypotheses, that we now consider, unless there is in- 
variance, the most plausible way of setting critical values is by a bootstrap. 
The natural choice is to simulate from P&. That is, let S(j,a) = n _L (/i, a), 
where h = ^(7, a), and let 

n 

Z n ( 7 )=n- l / 2 Y / S(j,a)(X l ), 
i=i 

where the AYs are i.i.d. from We expect that if (6) holds, Z n => Z(-,ao) 
in Pq probability. That is, the Prohorov distance between the Pa distribution 

of Z n and the distribution of Z a tends to in P a probability. 

To ensure that this bootstrap method works for both Z n and Z n we need 
to simply replace conditions (M0)-(M5) by versions uniform in qo € A$. 
We leave a formal statement to the reader. Alternatively, in these cases, as 
has been explored in [2] and [4], it is also possible to use the m out of n 
bootstrap, simulating the distribution of the statistic for samples of size m 
by drawing subsamples of size m from the original sample, where m — > 00, 
m/n — > 0. 

There is another way of bootstrapping discussed in [5] which may be 
simpler since it only involves resampling. Let 

n 

Z* n {-)=n- l / 2 Y,(S{l,a*){X*) - S(rt,&){Xi)), 
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where a* = a(Xi, . . . ,X*) and X\, . . . ,X* are i.i.d. from the empirical dis- 
tribution P n = n~ l Y^i=i^Xi- The appropriate heuristic is that again if (6) 
holds, Z* t => Z(-,ao) in Pq probability. 

This bootstrapping method is more problematic to check. Essentially what 
is needed for Z* n to obey (3.5) are conditions given in [5]. For these we make 
the following identification: Suppose a = a{P n ) where a : M — > A, a(P a ) = a 
for P a £V and M is the set of all probabilities. Let T:M^ L{H) x L(H) 
be defined for TC a Banach space of functions containing {/t 7 ."7 G T} and 
L{7i) the set of bounded linear functionals on Ti, by 

T{P){h) = (j U(h,a(P))dP,J hd(P-P a(P) )j. 

Note that T(P a ) = so that the hypothesis is contained in {P:T(P) = 0}. 
Now put on T(P) the conditions specified by Bickel and Ren [5]. 

3.4. Power. It is easiest to see what happens to the processes on which 
we build our tests in the case where alternatives converge to Pq in the n -1 / 2 
scale. Specifically suppose {Pt : \t\ < 1} is a one-dimensional regular para- 

metric model through Pq with score function g(-) such that g ^ Po(Pq), that 
is, it is possible to discriminate {Pt,t ^ 0} from Vq at the n -1 ' 2 scale. Then, 
let Z g (h^,0) be the Gaussian process with the same covariance structure as 
Z(hy, 0) but with 

E Z g (hy,0) = J(h 7 - U(h 7 ,a ))gdP . 

Evidently, Z g (^i 7 ,0) = Z(h^,0), if g does not have a component orthogonal 

to Vo- Define Z ff (/i 7 ,a) similarly for g £ L2\P a ), g ^ Vo{Pa)- The follow- 
ing result is an immediate consequence of Le Cam's LAN theory; see, for 
example, his "third lemma" [12] and our theorems. 

Theorem 3.3. Suppose g G Vq (P a ) is, for each a, the score function of 
a regular model through P a . Assume the sufficient conditions of Theorems 
3.1 or 3.2 hold. Then Z n (-) Z g (-,a), where =^ is weak convergence under 
Pg,t„! where {P g ,t '■ \t\ < 1} is a regular model passing through Pq = P a , with 
score function g at 0, and t n = tn" 1 ^ 2 for fixed t. 

Suppose q is bowl-shaped and symmetric and its discontinuity set has 
probability 0. That is, if C = {z : q is continuous at z}, P[Z(-,a) or Z g (-,a) ^ 
C] = 0, and q:loo(T) — > R,q(z) = q(—z), q(Xz) strictly increasing in A for 
A > and all z. Then, if, as we assume, Z(-,a) is tight, we have 



(12) 



E(q(Z(;a)))<E(q(Z g (-,a))). 
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Equation (12) follows from Anderson's theorem if {G±, . . . , Gk} forms a par- 
tition of T, Z(7, a) is replaced by Z( k \^,a) = J2j=i ^{lj^ a )^-{l G Gj) an d 

Z g is similarly approximated. Now \\Z g (•,«) — Zg^^a)^ — >0 as k — ► oo, 
for all g including g = 0, and (12) follows in general. 
It holds that test statistics of the form (7)-(8) have 

liminf P g + [T n > c] > liminf Po[T n > c] 

n n 

for i n = An -1 / 2 , all A > 0, c as desired, and for all g such that E^Zgih-y, 0) 7^ 
for some 7. 

3.5. Consistency. Consistency against fixed P ^ Vo can be obtained by 
a strengthening of conditions — though the strengthening we now give is 
overkill. 

Suppose that P is as above. For a suitable a{P) S A call (Mpj), j = 
0, ...,5, condition (Mj) with Po replaced by P and ao replaced by a(P). 
Define the process Zp(-) as the Gaussian process with mean zero and the 
covariance structure given in (5) with n(-,ao) replaced by II(-,a(P)) and 
Po replaced by P. Then the conclusion of Theorems 3.1 and 3.2 holds if Po 
is replaced by P with: 

(i) Z n (/i 7 ) replaced by Z n {h^) — ^Jn J /i 7 (-,a(P)) dP; 

(ii) Z(-,ao) replaced by Zp(-). 

p 

We conclude that \Z n (h 1 )\ — > 00 if / /i 7 (-,a(P)) dP 0. Thus consistency 
holds for T n given by (7) if / /i 7 (-, a(P)) dP / for some 7 and all P ^ Pq. 
Consistency for other statistics can be reasoned analogously. 

4. Examples. In this section we consider a few important examples in 
which we show how our notions produce tests which have appeared in the 
literature and some new ones. Our point is to illustrate the ideas of the score 
process and the tailor-made tests. 

4.1. Testing goodness of fit to a composite parametric hypothesis. Let 
{Pe : 6 € C R d } be a regular parametric model and let 9 n be a regular 
([3], pages 18-19) estimate of under H. We test H against a saturated 

model such that V{P) = {a G L 2 (P) : Ea{X{) = 0}: 

V (P e ) = spanjj-pd, 0), . . . , ^-(^,0)}, 
where I = logp(x,0) is the log-likelihood. Then 
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and Cj(a,0) is the coefficient of the projection of a on Vo(Pe), defined by 
minimizing 




Identifying a with 9, h = h y (-,6), we obtain 

d 81 
S(j, 9) = h- E e h{X x ) - ]T Cj (h, e) — (x 1 ,e). 

j=i j 

The corresponding estimated score process is, for an estimate 9, given by 
n ( ^ ai ) 

i B (/i) = n- 1 / a x;|(W-^W)-E^(^ d )^:( jr *' d )f- 
i=i (. j=i j ) 

Suppose Vq is regular parametric, and more: 
(Rl) 9 is regular on 0. 

(R2) Suppose T is compact C R p , where R = [—oo, oo], the processes (7, 9) — ► 
n l / 2 {P n — Po)hy(-,9) are tight and sup x n ^\h-y(x,6)\ < 00. 

(R3) The map 9 — > h.(-,9) is continuous in the norm on functions of (7,2) 
given by ||u;|| 2 = sup 7 / u> 2 (x,^) dPo(x). 

The following proposition is a consequence of Theorem 3.2. We check its 
conditions using Lemma A.l. (MO) and (Ml) hold and (Nl)(i) and (N2) are 
immediate. We can check (M2) via (Nl) and (N2). Condition (Nl)(iii) fol- 
lows since H^(a(fl) - s(9 )) =s (9 )(9 - 9 ) + o P {9 - 6 ). Condition (Nl)(ii) 
requires further conditions. For instance, it follows if the likelihood ratio 
s(-,9)/s(-,9') is uniformly bounded for 9,9' within e of 9q, a case which un- 
fortunately excludes the Gaussian, but the condition can be checked directly 
fairly easily for suitable 9. We have established 

Proposition 4.1. // (R1)-(R3) and (Nl)(ii) hold for a regular para- 
metric hypothesis and given estimate a, then Z n (-) => Z(-) where 

cov(Z(/ ll ),Z(/ l2 )) = cov 0o (/ ll (X 1 ),/i 2 (X 1 ))-c T (/i 1 ,6>o)/(6>o)c(/ i2 ,6» o ), 
where c = (ci, . . . , and I is the Fisher information. 

Versions of a result such as this one appear in [9] and [19] when the /i 7 
are indicators of half lines. 

The corresponding result for the bootstrap process Z n is also valid as is 
that for Z*. That is, both the parametric and the Bickel-Ren application of 
the nonparametric bootstrap to testing can be used to set critical values. 

Suppose h does not depend on a and we weaken (Rl) to 
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(Rl)' 9 = 9 + o Pg (n 



I 4 ) for all 0e6. 



The theorem will still hold provided that we have Cramer conditions on 
several derivatives of the likelihood ensuring that the remainders in (Nl) 
are quadratic in 6 — 9q. Note that Proposition 4.1 enables us to plug in 
subemcient estimates without affecting the properties of our tests. 

Proposition 4.2. If the hypothesis is regular parametric, (Rl), (R2) 
and (R3) hold and 9 is efficient in the sense of [3], page 43, then the con- 
clusion of Theorem 3.1 is valid. 

Proof. We need only check (M5) and (M4). The former follows from 



and this requires in view of (R2) only that —>p(-,0) is L\ differentiable, 
which is a consequence of regularity. The latter follows from regularity of Po 
and (Rl) and (R3). □ 

Again with 9, the MLE, results such as this one appear in [9] and [19]. 
The uniformity required for both versions of the bootstrap can easily be 
imposed. 

4.2. The Gaussian model. We specialize to one of the most important 
parametric hypotheses V = {J\f(fi,a 2 ) : fi € R, a 2 > 0}. Here we naturally 
take 9 = (X,a 2 ), the MLE's. It is convenient to use the invariance prop- 
erties of the hypothesis and take h{-,^,9) = h 1 (^^ L ) if 9 = (fjL,a). With this 
choice we are considering 



If {h-y = 1(— 00,7), 7 £ R} = Ho, then satisfaction of (R1)-(R3) is easy. Us- 
ing Z n as above we arrive at the common test statistics of Kac, Kiefer and 
Wolfowitz [17]: the Kolmogorov-Smirnov type, 





o P (\e-e \) 




sup I Z n (7) I =sup Fj 

n x 
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and the Cramer-von Mises type, 

|~ Z 2 n (h x ) = |~ (f„ ) " 2 ^0*0, 

where F n is the empirical d.f. TCq is a well-known universal Donsker class 
and the classical limiting result for Cramer-von Mises tests given in [17] 
follows. Tests can be implemented for both statistics using either of the two 
bootstraps. Invariance here implies that only simulation under M(0, 1) is 
required. Other classes of tests are covered, for example, tests based on the 
empirical characteristic function [11]. 

We can also tailor statistics more carefully. For example, we can consider 
the two Gaussian mixture models as the alternative: 

(l- £ )$(^)+ £ g>( *-^- A ), p,AeR,a>0,0<e<± 

At least formally the tangent set V{Pq) (see [3], page 50) at e = 0, = (//, a) 
is just the set span{A"i - p, (Xi - p) 2 } U {(exp{^(Xi - /J,)- ^} - 1) : A G 

R}. We are led to consider Hq = {exp{Ax — 4-} : A G R} and statistics such 
as 



! SUp 
A 



1 " 

L y"( e A((Xi-x)/*)-A 2 /2 

n ~ 



Unfortunately, 7{q is not a Donsker class and T n — ► oo under ff; see [1]. 
Our heuristics and Theorem 3.1 apply if we restrict A to a compact set. The 
power against n~ 1 / 2 alternatives of such T n persists. Note that T n can be 
viewed as a diagnostic since the maximizing value of A indicates where a 
second component might be. 

We can also consider versions of the Cramer-von Mises approach reflecting 
our goals more precisely. For instance, consider a wavelet basis for [0, 1] 
written lexicographically ujij in order of scale and then within scale with 
u\\ = 1. Then given that we care more for departures at lower scales, consider 
Aij = Xi ~ p\ 1 < j < 2\ p < \. Since ||oo = 0(2^ 2 ), if we let hij = 
ujij(&(-)), and h t (x) = Y^i,j \jhij(x)hij(t), then 1 1 /i* 1 1 oo < M < oo and T = 

Jq 1 Z 2 n (f) dt = n(h j)] 2 where Z n (i,j) falls under the statistics 

covered by Theorem 3.2. 

An interesting basis to consider is the set of normalized Hermite poly- 
nomials hj(x) = (— iy . d / <p( x ) i 3 — 3- Here /13 and hi correspond to 
skewness and kurtosis so that it is attractive to make A3 = A4 = 1 and Xj 
decrease rapidly further on. 

We stress again that Propositions 4.1 and 4.2 can be applied to all these 
diverse tests. 
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4.3. Independence. One of the most important semiparametric hypothe- 
ses corresponds to X = (U, V) ~ P, H : P = Pjj x Py, U and V are indepen- 
dent, U,V £ R. In this case the NPMLE of P under H, known to be efficient, 
is P n = P n u x P n y , where P n u and P n y are the empirical marginals of U 
and V, and is known to be efficient ([3], Chapter 5). Thus 

In (7) = Vn~{Pn ~ (PnU X P„y ))(/») 

{1 n -. n n 

- ]T fc(E7i , V,, 7, ^n) " - E E ^ . 7, Pn) 

Tl . 77'.-,.-, 
1=1 1=1 j = l 

Natural hj(u,v) here are of the form h\ 1 {u)h2 1 {v ). If we take fo 7 (zi, = 

1q 7 ( w ^) = 1Qi 7 (^)1q 27 ( v ) 5 where = Q17 x <22 7 , Qi 7 = (-00, Tj], 
j = 1,2, we arrive at the familiar 

Zn{i) = Vn(Fn(li,72) ~ F nU (71 )F nU (72)), 

where F n ,F n u,F n y are the appropriate empirical d.f.'s. 

Application of Theorem 3.1 here is appropriate and easy. (MO) simply 
says 7 -> l((sc,2/) G Q 7 ) - i*M7i)l(y < I2) - F v (j 2 )l(x < 71) is a Donsker 
class, essentially a statement about the bivariate empirical process. Since /i 7 
does not depend on a, (M3) and (M4) are immediate. Finally, (M5) is well 
known for this process — see, for instance, [3]. 

If we take ^ a {d^f) = dFu(j{) dFy^) with a = (Fjj,Fy) we obtain the 
Kiefer-Wolfowitz statistic 

T = n J y'(Fn(7i,72)-Pn^(7i)Pny(72)) 2 dPni/(7i)rfPny(72). 

If we take T = sup 7 |^ n (7ij72)| we obtain the Kolmogorov-Smirnov version 
of the Kiefer-Wolfowitz statistic. 

Invariance under monotone transformation of H suggests 

(13) h(u,V, 7 ,Pu xPy)=l Q .(F U {u),Fy(v)), 

where Fu,Fy are the c.d.f.'s of U,V, and leads to Z n (^f), a linear rank test 
statistic, 

4. (7 )_»-v»t h ,(*|)_^ s ff hT (i.i), 

i=i i=i]=i 

where Ri is the rank of Ui among the Vs and Si is the rank of V% among the 
Vs. These are the building blocks of the Kallenberg-Ledwina [18] statistics, 
though the ones they propose are of non-n -1 ' 2 consistent type. We leave 
it to the reader to construct tests with power against all n~ 1 / 2 alternatives 
and directions that (s)he prefers. 
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Proposition 4.3. Suppose h 1 is given by (13), a is the NPMLE as 
specified and oq has continuous marginals. Then, (MO), (M3), (M4) and 
(M5) hold. 

PROOF. In this case 

U(h,a)(x,y) = J h(x,v) dF v (v) + J h(u,y)dFu(u) 

-2 J h(u,v)dFu(u) dF v (v), 
where Fjj,Fv <-> a. So (M5) can be written 



(14) sup 



hy(x,y) d(¥ nU (x) - Fu(x))d(¥ nV (y) ~ Mv)) 



opin- 1 / 2 ) 



where ¥ n jj is the empirical d.f. of U, Fqu corresponds to ao, and so on. For 
this condition and all subsequent ones, we can assume, w.l.o.g., that Qo is 
the uniform distribution on the unit square by making separate probability 
integral transforms. But (14) is just 

sup |F nC/ (7i) - 7i | sup |F n „(7 2 ) - 72 1 = Op(n _1 ), 

0<7i<l 0<72<1 

and (M5) follows. (MO) has been discussed in connection with h~. For (M3) 
write 

(P n - P )(hy(-,&) - hy(;ao)) 

= (P n - P )(1(M) < (Fu l (li),Fy l ( l2 ))) - l((u,v) < (71,72))), 

where {Fjj,Fy) <-> a. Now, by Glivenko-Cantelli sup|F^ 1 (7i) — 71 1 — > 
and sup|Py 1 (72) — 72 1 — > 0. So (M3) follows from the weak convergence 
of n l / 2 (P n — P )(hy). (M4) is argued similarly. □ 

Application of the type I bootstrap is straightforward when Fjj and Fy 
are continuous under olq: In view of the invariance, we need only simulation 
under the uniform distribution on the unit square. Resampling from the 
empirical (type II) is also possible but the argument is more delicate. 

This result can easily be extended to more general h\^(v) ^27 (^) and we 
can also tailor tests here. For instance, consider the tensor wavelet basis 
on [0,1] x [0,1], {hij lt j 2 } where i corresponds to scale and {31,32) to the 
location. We can again suppose that departures from independence at lower 
resolution are more significant and proceed as in Section 4.1 to form 

hjl J2 
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x (¥ nU (x),¥ nV (y))d(P n - P n )(x,y)h iduj2 (u,v) dudv 




where I 2 is the unit square and P n = P n jj x P n y. The \ can be chosen so 
as to weight the lower-resolution terms as one pleases. 

4.4. Copula models. The standard copula model is X = (U, V), U, V G 
R as above, where for some monotone strictly increasing transformations 
a(-) :R^R, &(•) : R -»■ i? the vector (o(?7), 6(V)) ~ G 6, a regular para- 
metric model. A natural model to consider here is the bivariate Gaussian 
copula, where under 0, X has standard normal marginals with correlation 
0, — 1 < < 1. (Assuming unknown means and variances adds nothing since 
making a and b arbitrary makes these parameters unidentifiable.) In such a 
model we consider two problems: 

(i) H : P G Vo = {P$, a ,b :i)£6,fl,i) general}, the copula model hypothe- 
sis. 

(ii) H : = O . K : $ + i? within P - 

The first hypothesis requires use of efficient estimates of (a, 6,0). These 
are in general difficult to construct. Inefficient estimates are readily com- 
putable, but application of Theorem 3.1 requires computation of II(-,a) 
which can be characterized by Sturm-Liouville equations and computed nu- 
merically (see [3], pages 172-175). We do not pursue this interesting special 
case further. 

On the other hand, by our assumptions in Section 3, tests of (ii) are 
naturally based on hy(-,a) = /i 7 (F[/(-), Fy (•)), where the {/i 7 (x,y),7 G R p } 
are scores of the parametric model {P$ : 9 G 0} at = 0o- If efficient esti- 
mates Fjj, Fy under H are used, then Z n {^) is the asymptotically most 
powerful score test in direction 7. Otherwise, if, say, we use the empirical 
d.f.'s F n u and F n y, we can construct II(-,7), the projection on the tangent 
space of V$ = {-P(# 0) a,&) 'a,b, arbitrary} and use Z n . Finally, if efficient es- 
timates of under V are available, these can be used in the obvious way, 
though in general such estimates will be difficult to obtain. These hypothe- 
ses of finite codimension are the subject of Choi, Hall and Schick [8]. If we 
specialize to the Gaussian copula model and consider H:p = 0, the inde- 
pendence hypothesis, it is easy to see that the single asymptotically most 
powerful test is to use the normal score rank statistic 
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The reason here is that F n jj, F n y are efficient in this case — as we have 
already seen. Remarkably, Klaassen and Wellner [20] show that T is the 
asymptotically most powerful score statistic for H : p = po, any po, by show- 
ing effectively that 

T = —/=^2 K^FuiUi), ^FviVi),/*) + o P (l), 
v n i=i 

where h($>~ Fu(Ui), <£~ Fy(Vi), po) is orthogonal to the tangent space T > 2{po, 
(oq, b )), a = Q^Fu, b = 

The development of tailored tests for independence in copula models in 
general should be the same as for independence in general. 

APPENDIX 
Proof of Theorem 3.2. We need 

Lemma A.l. If (Nl) and (N2) hold, then so does (M2), 

sup{||P 6 (/0 - P (h) + P (U(h, a))!^ :heH} = o P (n^/ 2 ). 

Proof. By (Nl) we may w.l.o.g. assume a £ Aq- For simplicity let p = 
Pq so that sq = 1. Obvious modifications suffice if p S> Po- Then 



P&(h)-P (h)= I hs 2 dp- I hdp 

(A.l) 



= 2 J h(s - 1) dp + [h(s- l) 2 dp. 
Since P & U(h,a) = 0, 

P (U(h,a)) = - j n(/i,d)(s 2 - I) dp 

(A.2) 

= -2/ U(h,a)(s-l)dp + J U(h,a)(s-l) 2 dp. 

But, since P& <S Po we have by [3], formula (4b), page 50, 

H(h, a) = s _1 n(/is, ao). 

Therefore, if MeI 2 W, 

Ju(h,a)(s-l)dp = J n(hs,ao) ^ S ~^ dp 



J hsll(^ — - — ,ao^dp. 



20 



P. J. BICKEL, Y. RITOV AND T. M. STOKER 



Substituting in (A. 2) we get after some manipulation 
P (U(h,a)) = -2 f hU 



ao)sdn+ H(h, &)(§—!) dfx 



-2 jy hIi (s - 1, a ) d[i + J h(s - l)U(s - 1, a ) d/z. 

'(I -I) 2 ^ 



- / sn 



(A.3) 

+ /" n(/i,a)(s-l) 2 d / u 

= -2(1 + II + III) + IV . 
We bound the last three terms in absolute value by 



\II\ <M / \s-l\n(s-l,a )dfx<M\\s-l\\^<M 



dfj, 



(s - If 





< M 


(s -I) 2 
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\III\< M 

where M = sup w {||/i|| 00 + ||n(/i,a )||oo} < oo by (N2). 
Again, using (N2), 

(s-l) 2 



\IV\<M\\s-%<M 
Combining (A.1)-(A.3) we obtain 

Pa(h)-P (h)+P (n(h,a)) = 2 J h((s-l)-IL(s-l,a ))dn 



+ P 



(s-lf 



(s-lf 



= P { p-l-n(s-l,a )t + 

= op(n- 1 ' 2 ) 
by (Nl) and the lemma follows. □ 

Proof of Theorem 3.2. Write h for h 7 (-,a) and h for h 7 (-,ao). Then, 

Z n (l) = 7l 1/2 (P n h ~ P & h - P n U(h, &)) 

= n^ 2 {(P n - P )(h) - (P a - P )(h) - P n Il(h,a)} 
= n^ 2 (P n - P )(h - n(h, &)) + o P (l) 
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by (M2). But 

ra V2(p n _ p Q )(^ _ U (h, a)) = nV 2 (p n - P )(h - IL(h, a )) + o P (l) 

= Z n (7,a ) + op(l) 

by (M3) and (Ml). (The equivalences hold uniformly in 7 by assumption.) 
□ 
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